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Abstract 

The  present  paper  proposes  an  adaptive  biasing  potential  technique  for  the  compu¬ 
tation  of  free  energy  landscapes.  It  is  motivated  by  statistical  learning  arguments 
and  unifies  the  tasks  of  biasing  the  molecular  dynamics  to  escape  free  energy  wells 
and  estimating  the  free  energy  function,  under  the  same  objective  of  minimizing 
the  Kullback-Leibler  divergence  between  appropriately  selected  densities.  It  offers 
rigorous  convergence  diagnostics  even  though  history  dependent,  non-Mar kovian 
dynamics  are  employed.  It  makes  use  of  a  greedy  optimization  scheme  in  order  to 
obtain  sparse  representations  of  the  free  energy  function  which  can  be  particularly 
useful  in  multidimensional  cases.  It  employs  embarrassingly  parallelizable  sampling 
schemes  that  are  based  on  adaptive  Sequential  Monte  Carlo  and  can  be  readily 
coupled  with  legacy  molecular  dynamics  simulators.  The  sequential  nature  of  the 
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1  Introduction 

Free  energy  is  a  central  concept  in  thermodynamics  and  in  the  study  of  several 
systems  in  biology,  chemistry  and  physics  [11].  It  represents  a  rigorous  way  to 
coarse-grain  systems  consisting  of  very  large  numbers  of  atomistic  degrees  of 
freedom,  to  probe  states  not  accessible  experimentally,  to  characterize  global 
changes  as  well  as  investigate  relative  stabilities.  In  most  applications,  a  brute- 
force  computation  based  on  sampling  the  atomistic  positions  is  impractical  or 
infeasible  as  the  free  energy  barriers  to  overcome  are  so  large  that  the  system 
remains  trapped  in  metastable  free  energy  sets  [56,61,11,77]. 

Equilibrium  techniques  for  computing  free  energy  surfaces  such  as  Thermo¬ 
dynamic  Integration  [35],  Weighted  Histogram  Analysis  Method  (WHAM, 
[39,67]),  Adaptive  Integration  [71,26],  Multistate  Bennett  Acceptance  Ratio 
(MBAR,  [66])  require  the  simulation  of  very  long  atomistic  trajectories  in  or¬ 
der  to  achieve  equilibrium.  Furthermore,  sampling  along  these  paths  correctly 
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might  necessitate  advanced  and  quite  involved  techniques  [13].  Techniques 
based  on  non-equilibrium  path  sampling  [31,32,28,30]  lack  adaptivity  and  re¬ 
quire  the  user  to  specify  a  particular  path  on  the  reaction  coordinate  space 
connecting  two  energetically  important  free  energy  regions,  which  can  be  non¬ 
trivial  a  task.  More  recently  proposed  adaptive  biasing  potential  [3,76,41,2,24] 
and  adaptive  biasing  force  [17,16,29]  techniques  are  capable  of  dynamically 
utilizing  information  obtained  from  the  atomistic  trajectories  to  bias  the  cur¬ 
rent  dynamics  in  order  to  facilitate  the  escape  from  metastable  sets  [43] .  They 
are  able  to  automatically  discover  important  regions  of  the  reaction  coordi¬ 
nate  space.  Since  they  rely  on  history-dependent,  non-Markovian  dynamics,  it 
is  not  a  priori  clear,  and  in  which  sense,  the  system  reaches  a  stationary  state. 
Some  work  along  these  lines  has  been  done  in  the  context  of  the  adaptive 
biasing  force  method  in  [44],  for  Langevin-type  systems  in  [6]  and  in  [53,43]. 

We  propose  an  adaptive  biasing  potential  technique  where  the  two  tasks  of 
biasing  the  dynamics  and  estimating  the  free  energy  landscape  are  unified 
under  the  same  objective  of  minimizing  the  Kullback-Leibler  divergence  be¬ 
tween  appropriately  selected  distributions  on  the  extended  space  that  includes 
atomic  coordinates  and  the  collective  variables  [51,52],  This  framework  pro¬ 
vides  a  natural  way  for  selecting  the  basis  functions  used  in  the  approximation 
of  the  free  energy  and  obtaining  sparse  representations  which  is  critical  when 
multi-dimensional  collective  variables  are  used.  It  allows  the  analyst  to  utilize 
and  correct  any  prior  information  on  the  free  energy  landscape  and  provides 
an  efficient  manner  of  obtaining  good  estimates  at  various  temperatures.  The 
scheme  proposed  is  embarrassingly  parallclizable  and  relies  on  adaptive  Se¬ 
quential  Monte  Carlo  procedures  which  enable  efficient  sampling  from  the 
high-dimensional  and  potentially  multi-modal  distributions  of  interest. 
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The  structure  of  the  rest  of  the  paper  is  as  follows.  In  the  beginning  of  Section 
2  we  motivate  our  method  for  the  alchemical  case  arriving  at  the  identification 
of  three  (interconnected)  problems:  the  selection  of  a  parametrization  of  the 
free  energy  function,  the  choice  of  a  distance  metric  in  the  space  of  probabil¬ 
ity  densities  and  an  optimization  scheme  to  minimize  this  distance  between 
two  appropriately  selected  densities.  Section  2.1  discusses  the  suitability  and 
advantages  of  the  Kullback-Leibler  divergence.  Section  2.2  deals  with  the  opti¬ 
mization  strategy  employed  and  the  use  of  a  stochastic  approximation  scheme 
that  guarantees  convergence  under  weak  conditions.  Section  2.3  discusses  a 
suboptimal  strategy  for  the  successive  resolution  of  the  free  energy  landscape 
by  progressive  addition  of  basis  functions.  Section  2.4  is  concerned  with  an 
adaptive  Sequential  Monte  Carlo  scheme  for  the  estimation  of  the  expectations 
involved.  Section  3  demonstrates  the  advantages  of  the  proposed  method  for 
two  important  extensions:  the  reaction  coordinate  case,  and  the  calculation 
of  the  free  energy  function  as  a  function  of  temperature.  Finally  Section  4 
contains  results  that  illustrate  the  capabilities  of  the  method  with  numerical 
results  from  three  test  cases. 


2  Methodology  -  A  statistical  learning  approach  for  adaptively  cal¬ 
culating  free  energies 

For  clarity  of  the  presentation,  we  will  first  introduce  our  method  for  the 
so-called  alchemical  case  and  generalize  it  later  for  the  reaction  coordinate 
case.  Consider  a  molecular  system  with  (generalized)  coordinates  q  £  A4  C 
Mn  following  a  Boltzmann-like  distribution  which  in  turn  depends  on  some 
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parameters  z  e  T>  C  (in  general  d  <<  n) : 


p(q|z)  oc  exp  (—(3V (q;  z)) , 


(1) 


where  1/ (q;  z)  is  the  potential  energy  of  the  system  and  (3  plays  the  role  of  in¬ 
verse  temperature.  The  free  energy  A( z)  is  defined,  up  to  an  additive  constant, 
as: 

A(Z)  =  -/r1  log  [  exp  (~(3V(q-  z))  dq.  (2) 

In  general,  the  paperameters  z  provide  a  coarse-grained  description  of  the 
molecular  system  and  A(z)  concisely  summarizes  the  system’s  behavior  with 
respect  to  those  variables.  Our  goal  is  to  compute  the  free  energy  function 
A( z)  over  the  whole  domain  T>. 


Let  A(z;  6)  be  an  estimate  of  A( z)  parametrized  by  0  E  ©  C  This 
parametrization  will  be  made  precise  in  the  sequel.  We  define  a  joint  prob¬ 
ability  distribution  on  the  (generalized)  coordinates  q  and  the  parameters  z 
as: 

P(q,z  I  9)  =  (3) 

where  lx>(z)  is  the  indicator  function  on  T>  and  Z(9)  is  the  normalization 
constant,  i.e.: 

Z(0)  =  /  e-/3(y(q’z)-i(z;e))dqdz.  (4) 

JvxM 

It  is  easy  to  verify  that  the  marginal  density  of  the  parameters  z  e  T>  is  given 
by: 


p(z  I  e)  =  ImP(<Iz  I  6)  d(l 

:lI,(z)e-/3(A(z)-A(z;0)). 


(5) 


i 

Z(0)- 


The  key  property  of  p(z  \  G)  is  that  it  reduces  to  the  uniform  distribution  on 
T>  if  and  only  if  the  free  energy  estimate  is  exact  (up  to  an  additive  constant), 
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i.e.  A(z;  6)  =  A{ z),  z  e  V.  As  a  result  a  natural  strategy  to  estimate  A(z)  is 
by  minimizing  a  distance  metric  between  p(z  \  Q)  and  the  uniform  distribution 
over  V. 

To  make  things  mathematically  precise,  let  V  denote  the  the  set  of  all  proba¬ 
bility  densities  on  T>  with  respect  to  z,  and: 

tt(z)  =  l©(z)  (6) 

the  uniform  density  on  T>  (  whose  volume  is  denoted  by  |  T>  |).  Our  approach 
to  the  free  energy  estimation  problem  consists  of  three  key  ingredients: 

(1)  the  selection  of  a  parameterization  for  A( z;  9), 

(2)  the  choice  of  a  distance  metric  d:?x?4l 

(3)  a  procedure  for  the  solution  of  the  minimization  problem 

6*  =  argmind  (7r(z),p(z|0)) . 

With  regards  to  the  first  ingredient,  we  adopt  representations  inspired  by 
kernel  regression  expansions.  Kernel  regression  models  have  been  proven  suc¬ 
cessful  for  functional  approximations  in  high-dimensional  cases  where  d  is  in 
the  order  of  10  or  100  [72,73].  The  unknown  function  is  selected  from  a  Repro¬ 
ducing  Kernel  Hilbert  Space  (RKHS)  Hk  induced  by  a  positive,  semi-definite 
kernel  In  particular,  the  approximate  free  energy  function  A(z\0)  is 

expressed  as: 

M^e)  =  J26iK(zizATi)  ='■  Y,eiKAz)i  Z^V  (7) 

3= 1  i=1 

where  z3  are  points  in  T>  and  t3  kernel  parameters  whose  role  is  described  in 
the  sequel.  In  order  to  fix  the  additive  constant,  we  select  a  point  z0  e  V  such 
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that  A( z0;  0)  =  0  1 .  In  relevant  literature  different  types  of  kernel  functions 
have  been  used  such  as  thin  plate  splines,  multiquadrics  or  Gaussians.  While 
all  these  functions  can  be  employed  in  the  framework  presented,  we  focus  our 
presentation  on  Gaussian  kernels  which  also  have  an  intuitive  parametrization 
with  regards  to  the  scale  of  variability  of  A  as  quantified  by  the  bandwidth 
parameters  Tj  =  {Tjj}f=1  in  each  dimension: 

d 

Kj{ z)  =  K(z,zj-Tj)  =  exp {~Y,rj,i(zi  ~  (8) 

i=i 

Gaussian  kernels  in  the  context  of  free  energy  approximations  have  also  been 
used  in  [41,51,24],  In  Section  2.3,  we  discuss  a  way  of  adaptively  determining 
the  cardinality  of  the  expansion  k,  as  well  as  the  precise  form  of  Kj  (i.e..  z j 
and  Tj). 


With  regards  to  second  ingredient,  we  employ  the  Kullback-Leibler  divergence 
as  a  measure  of  distance  between  probability  distributions.  As  discussed  in 
Section  2.1,  this  choice  possesses  computational  and  theoretical  advantages 
and  leads  to  a  convex  optimization  problem.  Finally  with  regards  to  the  third 
ingredient,  we  propose  a  stochastic  gradient  descent  scheme  as  discussed  in 
detail  in  Section  2.2.  The  latter  involves  a  stochastic  approximation  scheme 
and  a  sampler  for  the  estimation  of  expectations  involved  which  is  discussed 
in  Section  2.4.  The  algorithmic  steps  are  summarized  in  Algorithm  2. 


1  This  is  always  possible  by  changing  the  kernels  in  Equation  (7)  to  Kj(z,Zj)  = 
Kj{z)  -  Ay(z0,Zj). 
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2. 1  Choice  of  the  metric 


We  propose  employing  the  Kullback-Leibler  (KL)  divergence  KL(7t(z)  ||  p( z 
0))  [14]: 


KL(tt  II  p)  =  /  7r(z)log 
Jv 


7T  Z 


dz. 


(9) 


1  7>(Z  I  0) 

It  is  always  non-negative  and  becomes  zero  if  and  only  if  7r(z)  =  p( z  |  G )  or 
equivalently  A(z;  6)  =  A(z),  z  G  T>  2  .  Despite  the  fact  that  it  is  not  a  metric 
in  the  mathematical  sense,  it  is  frequently  used  as  a  measure  of  the  distance 
between  two  probability  distributions.  Furthermore  the  KL-divergence  pro¬ 
vides  upper  bounds  to  other  commonly  used  metrics  such  as  the  Hellinger 
distance  defined  by: 

2  \  1/2 

dri 

as  well  as  the  total  variation  distance:  3 


II  P)  =  (jv  (Vp(z|0)  -  \AKz))  dz)  , 


V ("7T  ||  p)  =  Slip 
Be&(v) 


IB 


(p(z\0)  -  n(z))  dz 


Le  Cam’s  inequalities  as  well  as  Lemma  2.4  of  [74]  imply  that: 


V2(7T  ||  p)  <  H2(7T  ||  p)  <  KL( 7T  ||  p). 


(10) 


Hence  an  minimization  of  the  KL-divergence  provides  good  approximations  of 
the  free  energy  surface  with  respect  to  these  two  genuine  distances  as  well. 


Since: 


KL(vr  ||  p)  —  —  log  |  X>  |  —  /  7r(z)  logp(z  |  9)dz, 


2  As  already  mentioned,  of  interest  are  free-energy  differences  and  therefore  per¬ 
turbations  of  A( z)  or  A(z;  6)  by  a  constant  are  ignored. 

3  We  denote  here  with  B{V)  the  set  of  the  Borel  sets  of  Wl  that  are  subsets  of  T>, 
i.e.  the  set  of  all  Lebesgue  measurable  subsets  of  V. 


the  aforementioned  formulation  offers  a  clear  strategy  for  estimating  the  free 
energy  by  minimizing  the  following  form  with  respect  to  0. 

1(0)  =  —  j  7r(z)  log p(z  |  0)dz.  (11) 

The  KL-divergence  is  always  non-negative,  so  the  objective  function  1(0 )  has 
a  lower  bound: 

1(0)  >  log  |  V  |  .  (12) 

This  lower  bound  can  be  readily  calculated  and  used  to  monitor  convergence 
as  well  as  the  quality  of  the  approximation  obtained. 

Notice  that  1(0)  depends  on  the  unknown  free  energy  A( z)  explicitly  (from 
Equation  (5)): 


1(0)  =  —f  7r(z)logp(z  |  0)dz 


(13) 


=  PS’ rW  (-4(z)  -  Mz'-0))  iz  +  log Z(e). 

However,  its  gradient  J(0 )  =  '-ppp-  depends  on  A( z)  only  through  an  expec¬ 
tation  over  p(z\0).  In  particular,  by  differentiating  Equation  (4)  we  obtain: 


and  thus: 


<9  log  Z(0) 
ddi 


1  8Z(0) 
Z(0)  09, 


/3-Ep(z|0)[Aj(z)] 


J,(0) 


am 

aej 


(14) 


/d-E^qz) 


aA  ,  aiogz 

ddj  \  “T  ddj 


(15) 


=  P  (ep (,|»)  [Kj(z)\  -  [Kj( z)])  , 

where  -Ep(z|0)[-]  and  z)[-]  imply  an  expectation  with  regards  to  p(z\0)  and 
7r(z)  respectively.  Given  the  unavailability  of  p(z  \  0)  (since  it  depends  on  the 
unknown  free  energy  A( z)),  the  expectations  above  can  only  be  computed  by 


9 


Monte  Carlo  sampling  in  the  joint  space  M.  xT>  with  respect  to  the  joint  den¬ 
sity  p(q,  z  |  6)  (Equation  (3))  that  is  known  up  to  the  normalization  constant. 
The  algorithmic  scheme  employed  for  the  computation  of  these  expectations 
is  discussed  in  detail  in  Sections  2.2  and  2.4. 

Finally,  it  is  important  to  note  that  the  Hessian  of  the  objective  function  9qqqqt 
is  proportional  to  the  covariance  between  the  kernels  i.e.: 


d2i  _  a\ogZ{e) 

ddjdei  dejdel 


1  az  oz  _i _ 1  a2z 

Z2(6)  dOj  ddi  '  Z{0)  dOjddi  ^ 


ft  -Ep(z|0) 


(Aj(z)  -  SpMe)[A»])(ir,(z)  -  BpMS)[A,(z)]) 


—  I32  CoVp^ifl,  [Kj ,  K[\ . 

As  long  as  the  covariance  matrix  is  positive  definite,  the  objective  function  is 
convex  with  respect  to  6  and  there  is  a  unique  minimum.  For  the  Gaussian 
kernels  employed  (Equation  (8))  which  have  infinite  support,  this  condition  is 
satisfied  except  for  degenerate  cases  of  p( z  |  6).  The  formulation  presented  can 
also  be  interpreted  by  using  arguments  based  on  the  well-known  Expectation- 
Maximization  algorithm  (EM,  [23])  as  discussed  in  appendix  B  where  potential 
Bayesian  extensions  are  also  presented. 


2.2  Optimization  with  noisy  gradients 


We  propose  employing  a  gradient  descent  scheme  in  order  to  determine  6.  It  is 
noted  that  for  similar  computational  problems  as  they  appear  for  example  in 
the  context  of  maximum  entropy  estimation,  more  involved  procedures  such 
as  Improved  Iterative  Scaling  [4,22]  and  noisy  conjugate  gradients  [65]  have 


10 


been  be  employed.  Second-order  (quasi-)Newton-Raphson  techniques  are  also 
possible  although  the  unavoidable  Monte  Carlo  noise  in  the  computation  of 
the  Hessian  (i.e.  the  covariance  in  Equation  (16))  can  destroy  its  positive 
definiteness. 

Let  9k  denote  the  vector  of  kernel  amplitudes  (Equation  (7))  when  k  such 
kernels  are  used.  Let  also  0^  denote  the  estimate  of  9k  after  m  iterations  of 
the  gradient  descent  algorithm.  Then  at  the  (m  +  1)— iteration,  the  following 
update  equation  could  be  used: 

C+1  =  0km  -  A  J(0  (17) 

where  A  >  0  is  the  learning  rate. 

In  general  exact  calculation  of  the  gradient  J(0)  (Equation  (15))  is  impossible 
and  one  must  resort  to  noisy,  Monte  Carlo  estimates.  The  noise  can  impede 
convergence  or  even  lead  to  a  divergent  scheme.  For  that  purpose  we  propose 
employing  a  stochastic  approximation  variant  of  the  Robbins  &  Monro  scheme 
[62,9]  in  combination  with  an  adaptive  Sequential  Monte  Carlo  sampler.  The 
former  ensures  convergence  with  finite  sample  size  and  does  not  necessitate 
equilibrium  samples  from  p( z  j  6).  The  latter  produces  estimators  with  lower 
variance  as  compared  to  standard  Markov  Chain  Monte  Carlo  schemes  and 
leads  to  accelerated  convergence.  It  is  noted  that  stochastic  approximation 
schemes  in  the  context  of  free  energy  estimation  have  previously  been  used  in 
[46,47], 

If  J(0£)  denotes  the  Monte  Carlo  estimate  of  the  gradient  obtained  with  a 
finite  sample  size  (the  details  of  this  estimator  are  discussed  in  section  2.4), 
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then  at  the  mth  iteration  we  update  6k  as  follows: 


Okm+1  =  Okm-VmJ(Okm),  (18) 

where  {r]m}  is  an  appropriately  chosen  sequence  of  learning  rates.  According 
to  [68]  (page  106),  the  aforementioned  scheme  converges  almost  surely  to  the 
root  6k  of  .7(0): 

J(0fc)  =  0, 

if  the  following  four  conditions  are  satisfied: 


Cl)  (Learning  Rates):  r]m  >  0,  r)m  ->•  0,  £”=0 rjm  =  oo  and  E“=0i  <  oo. 

C2)  (Search  Direction):  For  some  symmetric,  positive  definite  matrix  B  and 
every  0  <  e  <  1, 


inf  (ek  -  ek)T  BJ(Gk)  >  0. 

e<||0fe-0fc||<l/e  V  ' 

C3)  (Mean-zero  noise):  The  estimator  J(6k)  of  J(6k )  is  unbiased  i.e.: 


E 


J(0fc), 


(19) 


C4)  (Growth  and  variance  bounds): 

<  c(i+  ||  e  ||2) .  (20) 

In  this  work  we  use  sequences  of  the  form: 

=  71 

^ m  (m  +  mo)-01' 

with  1/2  <  ck  <  1,?7  >  0  and  mo  >  0,  which  ensures  that  condition  (1)  is 
satisfied.  In  all  numerical  examples  of  Section  4  we  used  m0  =  0.  In  Appendix 
A,  we  prove  that  condition  C2  holds  in  our  case  for  B  =  I  i.e.  the  identity  ma¬ 
trix.  Condition  C3  is  a  trivial  property  of  the  Sequential  Monte  Carlo  sampler 


J(0) 
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as  discussed  in  Section  2.4  [19].  Finally,  the  fourth  condition  C4  (for  bounded 
kernel  functions  Kj  (Equation  (15)))  is  satisfied  for  Sequential  Monte  Carlo 
estimators  under  mild  condition  as  it  has  been  shown  in  [15]  (Theorem  1)  and 
[40]  (Theorem  4).  More  recent  works  have  increased  the  generality  of  these  re¬ 
sults  [12,21]  and  provided  asymptotic  rates  for  the  variance  of  the  estimators 
as  well. 


In  practical  terms,  Equation  (18)  implies  computing  a  weighted  average  of 
the  gradient’s  estimates  at  the  current  and  previous  iterations.  By  employing 
a  decreasing  sequence  of  weights,  information  from  the  earlier  iterations  gets 
discarded  gradually  and  more  emphasis  is  placed  on  the  recent  iterations.  In 
practice  the  gradient  descent  was  terminated  when  the  following  two  condi¬ 
tions  were  met.  Firstly,  when  all  all  the  components  of  the  gradient  J(0)  had 
crossed  zero  at  least  once.  Given  the  problem  convexity,  this  indicated  that 
further  flucluations  were  the  result  of  noise  in  the  Monte  Carlo  estimators.  Sec¬ 
ondly,  when  the  relative  change  in  6  was  smaller  than  a  prescribed  tolerance 


i.e. 


\\ak  _ ak  || 

l|Pm  +  l  "mil  ^  10_ ^ 


\\0L 


The  final  number  of  iterations  depends  also  on  the  number  of  kernels  present 
in  the  approximation. 


2.3  Kernel  Selection 

A  critical  objective  in  the  proposed  framework  relates  to  the  sparseness  of  the 
free  energy  approximation  i.e.  the  cardinality  k  of  the  expansion  in  Equation 
(7).  This  is  important  in  at  least  two  ways.  Firstly,  because  sparser  representa¬ 
tions  can  more  clearly  expose  salient  features  of  the  free  energy  landscape,  and 
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as  a  consequence,  of  the  atomistic  ensemble  considered.  Secondly,  because  they 
reduce  the  number  of  parameters  6  with  respect  to  which  the  optimization 
problem  needs  to  be  solved  (section  2.2).  Given  a  vocabulary  of  potentially 
overcomplete  basis  functions  and  a  prescribed  k,  the  problem  amounts  to  iden¬ 
tifying  those  kernels  (Equation  (7))  that  best  approximate  the  true  free  energy 
surface  i.e.  minimize  the  KL  divergence  for  2  e  D  (Equation  (9)).  Given  that 
the  objective  function  1(6)  implicitly  depends  on  the  kernels  selected,  the 
aforementioned  problem  is  equivalent  to  finding,  amongst  all  /c-sized  combi¬ 
nations  of  kernels,  the  one  that  gives  rise  to  the  smallest  1(G).  This  obviously 
implies  an  excessive  computational  effort  since  the  cumbersome  optimization 
problem  with  respect  to  G  would  have  to  be  solved  for  all  possible  k— sized 
combinations  of  basis  functions. 


For  that  purpose,  we  propose  a  suboptimal  scheme  that  proceeds  by  adding 
a  single  kernel  at  the  end  of  each  optimization  cycle  with  respect  to  6  i.e. 
the  cardinality  k  of  the  expansion  (Equation  (7))  increases  by  one.  Simi¬ 
lar  greedy  procedures  have  been  successfully  applied  in  maximum  entropy 
problems  [79,22,80].  Without  loss  of  generality,  one  can  consider  a  vocab¬ 
ulary  of  functions  that  consists  of  the  Gaussian  kernels  discussed  in  Equa¬ 
tion  (8)  which  are  parametrized  by  their  locations  z j  G  T>  and  bandwidths 
Tj  =  {tj.i  G  M+}f=1.  Given  k  such  kernels,  let  Gk  =  {0k}k=l  (Equation  (7)) 
denote  the  corresponding  parameters  that  minimize  1(6)  in  Equation  (11). 
The  goal  of  the  ensuing  scheme  is  to  select  G  V,  tu+ 1  =  {Tk+1,1  G  M+}f=1 
of  the  next  k  +  1  kernel.  Once  this  have  been  achieved,  the  corresponding 
Gk  =  {6k+l  values  are  obtained  by  carrying  out  the  optimization  process 
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discussed  in  the  previous  section  for  the  given  set  of  k  +  1  kernels.  Let  also: 


Ak{z-  0k)  =  Ef=i  OkKj{ z)  and  Ak+1(z-  0fc+1)  =  E*U  ^+1#;(z)  (21) 

denote  the  associated  free-energy  approximations  obtained  using  k  and  k  +  1 
kernels  respectively,  and  pk( q,  z  |  0fc)  and  (q,  z  |  0A+1)  the  corresponding 
densities  in  Equation  (3).  Note  that  in  general  8k  A  0k+1,  3  —  1,  ■  •  • ,  k  even 
though  the  first  k  kernels  in  the  aforementioned  expansions  (Equation  (21)) 
are  identical.  Since  0k  minimize  1(0),  the  gradient  of  1(6)  must  be  zero  at  6k , 
i.e.: 


dl(0) 

dO 


ek—  0  — y  E-X  [Kj(z)\ 


Ek  [K,( z)]  . 


(22) 


Given  that  Ak+i(z-,  6k+1)  provides  a  better  approximation  to  the  true  free 
energy  (or  at  least  just  as  good  as  Ak(z-,Ok)),  the  improvement  in  terms  of 
Kullback-Leibler  divergence  (Equation  (9)),  denoted  by  A^+i  can  be  assessed 
with: 


Afc+i  =  KL(tt  ||  pk)  -  KL(tt  ||  pk+i) 


(23) 


=  ?En  [Ak+1(z-,  0k+1)  -  Ak(z-  0k) \  -  log 


By  employing  Equations  (21)  and  (22),  it  can  be  shown  that  [79] 


Afc+1  =  f3En  \Ak+l(z-  6k+l)  -  Ak(z ;  0k)  1  -  log 


Z(0k) 


PErw  \At+1(z:  0t+‘)l  -  fiEn  \At(z ;  9 ‘)j  -  log  (24) 


=  KL(pk+ 1  ||  pk)  >  0. 


Formally  one  would  need  to  solve  an  optimization  problem  with  respect  to 
0k+1  for  all  possible  Kk+ i(z)  in  order  to  End  the  one  that  maximizes  A^+i  or 
equivalently  the  gain  in  terms  of  the  KL-divergence.  In  order  to  circumvent 
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this  difficulty,  we  employ  a  second-order  Taylor  expansion  of  Afc+1  detailed  in 

[79] : 

Afc+i  ~  — —  (En  [Kk+ 1  (z)]  —  EPk  [Kk+ i(z)])  ,  (25) 

*  var^fc+i 

where  Varfc  fc+1  is  the  conditional  variance  of  Kk+1( z)  given  Kj( z)  j  =  1, . . . ,  k 
with  respect  to  a  distribution  that  lies  between  pk  and  pk+\  in  the  sense  of  Kull- 
back  (page  48,  [38]).  We  propose  therefore  measuring  this  KL-gain  using  the 
difference  between  the  expected  values  of  Kk+i( z)  in  terms  of  the  (target)  uni¬ 
form  distribution  tt  and  the  current  approximation  pk.  This  effectively  suggests 
augmenting  our  expansion  with  the  kernel  that  locally  maximizes  the  gradient 
of  1(0).  Intuitively  it  implies  incorporating  the  kernel  function  whose  expected 
value  with  respect  to  the  target,  uniform  distribution  is  worst  approximated 
by  the  current  density  pk.  In  practical  terms,  and  given  the  parametrization 
of  the  Gaussian  kernels  employed,  this  amounts  to  finding  the  location  and 
bandwidth  parameters  (z£+1,r£+1  =  {'b*+i  }f=i)  (over  the  range  of  allowable 
values)  so  that: 

(zfe+i,  rfc+1)  =  arg  max  ( EPk  [K( z;  Zfc+1  ?  )]  [/^(z,  Zk-\-l  5  )  . 

(zfc+i>Tfe+i) 

(26) 

The  solution  of  this  simple,  low-dimensional  optimization  problem  can  be 
carried  out  using  any  standard  global/local  technique.  More  details  on  the 
methodology  adopted  in  the  examples  examined  are  contained  in  section  4.  It 
is  noted  that  the  expectation  with  respect  to  pk  in  Equation  (26)  is  approx¬ 
imated  using  the  Sequential  Monte  Carlo  samplers  discussed  in  section  2.4. 
Furthermore  regularization  terms  can  be  added  to  the  objective  function  of 
Equation  (26)  in  order  to  promote  lower  or  larger  bandwidth  kernels  (see  also 
appendix  B).  Naturally,  the  same  formulation  can  be  applied  with  any  type  of 
kernel  or  overcomplete  basis  employed  (e.g.  wavelets).  The  proposed  strategy 


16 


promotes  sparseness  and  computational  efficiency  while  offering  a  progressive 
resolution  of  the  free  energy  landscape  that  naturally  involves  kernels  that 
carry  most  of  the  information  in  the  first  steps  and  successive  unveiling  of 
finer  details  (see  the  first  example  of  section  4). 

It  is  finally  noted  that  once  the  next  kernel  has  been  selected  and  the  optimiza¬ 
tion  has  been  carried  out,  the  KL-gain  A^+i  (Equation  (23)),  offers  a  natural 
metric  for  monitoring  convergence.  The  expectation  with  respect  to  the  uni¬ 
form  can  in  general  be  calculated  analytically  whereas  the  ratio  of  normalizing 
constants  log  ^  (Equation  (3))  is  a  direct  output  of  the  Sequential  Monte 
Carlo  sampling  that  is  used  to  sample  from  the  augmented  densities  and  is 
discussed  in  the  next  section. 


2-4  Adaptive  Sequential  Monte  Carlo 

The  learning  scheme  proposed  relies  on  efficient  computations  of  the  gradient 
appearing  in  Equation  (15).  This  depends  on  expectations  with  respect  to 
p(z  |  6)  (Equation  (5))  which  are  not  available  analytically  since  the  actual 
free  energy  A(z)  is  unknown.  We  resort  to  a  Sequential  Monte  Carlo  (SMC) 
scheme  that  draws  samples  from  the  joint  density  p(q,  z  j  6)  in  Equation 
(3)  which  involves  the  atomic  degrees  of  freedom  q.  It  is  noted  however  that 
convergence  of  the  stochatic  approximation  algorithm  discussed  previously 
is  guaranteed  even  with  the  most  basic  MCMC  sampler.  It  is  nevertheless 
important  to  have  a  sampling  scheme  that  mixes  well  and  reduces  the  bias  in 
the  learning.  In  addition,  the  SMC  schemes  proposed  readily  enable  sampling 
from  a  sequence  of  distributions  that  is  advantageous  in  obtaining  several 
free  energy  landscapes  (i.e.  parametrized  by  the  temperature)  as  illustrated  in 
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section  3.2. 


SMC  samplers  [20,18]  represent  a  parallelizable  strategy  that  combine  the 
advantages  of  MCMC  and  Importance  Sampling,  resulting  in  lower  variance 
estimators  [49,34,33,70].  We  propose  novel  extensions  that  allow  the  algorithm 
to  automatically  adapt  to  the  difficulties  of  the  target  density,  while  retaining 
the  ability  to  interact  seamlessly  with  legacy,  molecular  dynamics  simulators. 

The  proposed  SMC  schemes  offer  a  ffexible  framework  for  sampling  from  a  se¬ 
quence  of  unormalized  probability  distributions  and  are  therefore  highly  suited 
for  the  dynamic  setting  of  the  problem  at  hand  where  the  target  density 
p( q,  z  |  6)  changes  with  6.  For  a  given  0,  they  approximate  p( q,  z  |  6)  with 
a  set  of  N  random  samples  (or  particles /replicas)  {qW,  which  are  up¬ 

dated  using  a  combination  of  importance  sampling ,  resampling  and  MCMC- 
based  rejuvenation  mechanisms  [19].  Each  of  these  particles/replicas  is  asso¬ 
ciated  with  an  importance  weight  w®.  The  weights  are  updated  sequentially 
along  with  the  particle/replica  locations  in  order  to  provide  a  particulate  ap¬ 
proximation: 

N 

P( q,z  \0)  VFW  <Jq(i)(q)<5a(o(z),  (27) 

2=1 

where  W W  =  w  (0/e£  ,  w(j'>  are  the  normalized  weights  and  <5Z(;)(-)  is  the 
Dirac  function  centered  at  zW  These  particles/replicas  and  weights  can  be 
used  to  estimate  expectations  of  any  p(q,  z  |  0)-integrable  function  which 
converge  almost  surely  as  N  — y  oo  [18,12],  In  particular  for  Equation  (15): 

N 

W(i)  Kj(z®) 

2=1 

The  proposed  SMC  algorithms  will  be  used  iteratively,  after  each  step  of 
the  gradient  descent  algorithm.  Given  two  successive  estimates  61/  and  6/+ , 


J  Kj( z)  p(q,  z  |  6)  dqdz  =  Ep{ z|0)  [Kj( z)]  .  (28) 
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(Equation  (18))  and  a  particulate  approximation  of  p( q,  z  |  0^J,  the  goal  is 
to  obtain  new  samples  from  p( q,  z  |  6^+1)  (Algorithm  2)  and  compute  the 
new  expectations  in  Equation  (15)  based  on  Equation  (28).  The  quality  of 
the  Monte  Carlo  estimates  in  Equation  (28)  depends  on  the  proximity  of  the 
distributions  p(q,  z  |  and  p(q,  z  |  0^+1).  We  propose  building  a  path 
of  intermediate,  unormalized  distributions  that  will  bridge  this  gap  based  on 
Equation  (3)  4 : 


vr7(q,  z)  =  p( q,  z  |  (1  —  +  7C+1) 

=  exp  {-£  (V (q,  z)  -  i(z;  07))  }  ,  7  G  [0, 1], 


(29) 


where 

«7  =  (l-7)0‘  +  7<+i-  (30) 

Clearly  for  7  =  0  one  recovers  p(q,  z  |  6^n)  and  for  7  =  1,  p( q,  z  |  0^+1). 
The  role  of  these  auxiliary  distributions  is  to  provide  a  smooth  transition 
path  where  importance  sampling  can  be  efficiently  applied.  Naturally,  the 
more  intermediate  distributions  are  considered  along  this  path,  the  higher  the 
accuracy  of  the  final  estimates,  but  also  the  higher  the  computational  cost. 
On  the  other  hand  too  few  intermediate  distributions  can  adversely  affect 
the  overall  accuracy  of  the  approximation. 


To  that  end  we  propose  an  adaptive  SMC  scheme  that  automatically  deter¬ 
mines  the  number  of  intermediate  distributions  needed  [19,37].  In  this  process 
we  are  guided  by  the  Effective  Sample  Size  (ESS,  [49]).  In  particular,  let  S 
be  the  total  number  of  intermediate  distributions  (which  is  unknown  a  pri¬ 
ori)  and  7S,  s  =  1,2,...,  S  the  associated  bridging  parameters  such  that 
4  subscripts  k  and  m  indicating  the  number  of  kernels  and  optimization  iterations 
respectively  have  been  dropped 
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0  =  7i  <72  <  •  •  •  <7 s  =  1;  which  are  also  unknown  a  priori.  Let  also 
{(qW,  z^),  W^}f=1  denote  the  particulate  approximation  of  7 r7s  dehned  as  in 
Equation  (29)  for  7  =  7S.  The  Effective  Sample  Size  of  these  particles/replicas 
is  then  dehned  as  ESSs  =  1/  (W^)2  and  provides  a  measure  of  the  popu¬ 

lation  variance.  One  extreme,  i.e.  when  ESSs  =  1,  arises  when  a  single  replica 
has  a  unit  normalized  weight  whereas  the  rest  have  zero  weights  and  as  a 
result  provide  no  information.  The  other  extreme,  i.e.  ESSs  =  N,  arises  when 
all  the  replicas  are  equally  informative  and  have  equal  weights  W®  =  1/iV. 

If  the  next  bridging  distribution  7t7s+1  is  very  similar  to  7t7s  (ie.  ys+1  ~  ys), 
then  ESSs+i  should  not  be  that  much  different  from  ESSs.  On  the  other  hand 
if  that  difference  is  pronounced  then  ESSs+i  could  drop  dramatically.  Hence 
in  determining  the  next  auxiliary  distribution,  we  define  an  acceptable  reduc¬ 
tion  in  the  ESS,  i.e.  ESSs+i  >  (  ESSs  (where  (  <  1  5)  and  prescribe  7s+i 
(Equation  (29))  accordingly. 

The  proposed  adaptive  SMC  algorithm  is  summarized  in  Algorithm  1.  The  re¬ 
sampling  component  was  carried  out  using  a  multinomial  resampling  scheme 
(i.e.  the  new  population  consisted  of  replicas  drawn  from  the  previous  pop¬ 
ulation  with  probability  proportional  to  their  weights)  and  was  triggered  for 
ESSmin  =  N/2.  It  should  be  noted  that  unlike  MCMC  schemes,  the  parti- 
cle/replica  perturbations  in  the  Rejuvenation  step  do  not  require  that  the 
Ps{-,  •)  is  ergodic  [20].  It  suffices  that  it  is  a  7T7s-invariant  kernel,  which  read¬ 
ily  allows  adaptively  changing  its  parameters  in  order  to  achieve  better  mix¬ 
ing  rates.  In  the  examples  presented  a  component-wise  Metropolis-Hastings 
scheme  ([5])  was  used  to  update  q  and  z  separately  by  employing  a  Metropolis- 

5  The  value  £  =  0.95  was  used  throughout  this  study. 
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Algorithm  1  Adaptive  SMC 

Require:  s  =  1  and  71  =  0  and  a  population  {(q^,z^),  which 

approximate  7r7l  =  p(q,  z  |  in  Equation  (29). 

Ensure:  The  population  {(0W,  zW),  provides  a  particulate  approxi¬ 

mation  of  7 r7s  in  the  sense  of  Equations  (27),  (28). 

while  7S  <  1  do 

S  i —  s  T  1 

{Reweighting-Importance  Sampling} 

Let 


(j)  eXP  {  ~P(Y  (<?)- 1  .*s-l 1  i07s  ) } 

exp|-^(^(<7^l,^2i)-d(2(7i;07s-i)} 

=  d7  exP  {-^(^(*£1);  (7.  -  7*-i)(®m+i  —  ®£>)} . 


(31) 


be  the  updated  weights  as  a  function  of  7S.  Determine  6  (7s-i,  1]  so 
that  ESSs  =  C  ESSs-i  . 

{Resampling} 

if  ESSs  <  ESSmin  then 
Resample 

end  if 

{Rejuvenation} 

Use  an  MCMC  kernel  Ps  [{q^l  1,  z'fd \ ) ,  (q^\  2^))  that  leaves  7t7s  invariant 
to  perturb  each  replica  (q^,  )  — »  (q}*\z^). 

end  while 

Adjusted  Langevin  Algorithm  (MALA)  for  each  set  of  coordinates  [63].  The 
Ps(-,-)  is  therefore  defined  implicitely  by  the  proposal  density  and  the  ac¬ 
cept/reject  step.  Given  (q*l\,  A-i) ,  the  proposals  consist  of: 
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•  Updating  — >  gW; 


Vq  log  'Kls(q(s-i,z; 


•s-l 


)  +  Y ^tqrq 


(32) 


)  +  \J  Atqr, 


Q- 


•  Updating  z^'l l  — *  z®: 


(33) 


where  rq  and  rz  are  i.i.d  standard  Gaussian  vectors.  A  Metropolis-Hastings 
accept/reject  step  with  respect  to  the  target  invariant  density  7t7s(-)  was  per¬ 
formed  after  each  update  which  ensures  7r7s -invariance.  Two  different  time 
steps  were  used  Atq  and  A tz  for  the  q  and  z  coordinates  respectively.  Their 
values  were  adjusted  after  each  iteration  s  so  as  to  retain  an  average  accep¬ 
tance  ratio  (over  all  replicas  N )  between  50%  and  80%  [64],  simply  by  in¬ 
creasing/decreasing  the  current  time  step  using  a  multiplication  factor6  .  The 
variable  time  step  ensured  better  mixing  at  the  Rejuvenation  step  and  con¬ 
tributed  to  the  adaptivity  of  the  algorithm.  The  theoretical  requirements  are 
satisfied  whether  one  or  more  MALA  time  steps  are  performed  in  Equations 
(32)  and  (33).  Naturally  other  molecular  dynamics  samplers  can  be  employed 
which  could  potentially  exhibit  better  mixing  or  fit  more  closely  to  the  physics 
of  the  problem  at  hand  [8]. 

It  is  also  noted  that  the  approximation  of  the  free  energy  A(z;  0),  biases  the 
potential  of  p( q,  z  |  6)  (Equation  (3))  and  allows  the  system  to  overcome  free 
6  The  multiplication  factors  we  used  in  the  numerical  examples  where  1.2  for  in¬ 
crease  and  0.7  for  decrease. 
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energy  barriers  [53].  Finally  we  note  that  the  estimates  of  the  ratio  of  nor¬ 
malization  constants  Zs/Zs_ \  between  two  successive  unnormalized  densities 
7t7s_1  and  7r7scan  be  obtained  by  averaging  the  unnormalized  updated  weights 
in  Equation  (31)  as  a  direct  consequence  of  the  importance  sampling  identity: 


Zs-l 


I  K-ls  (q.z)  dqdz, 

J  dcldz 


-  r  7r7s(q,z)  7^-1  G'2)  j  j 

-  J  TT^Cq.z)  Zs_!  (i^aZ 


y^iV  tt/W  (*3.1—1  >zs  —  1 ) 

Zi=l  VVS- 1  ,  (i)  (i)  y 

n~ts- 1  '.Cls-l’Zs-li 


(34) 


These  estimators  can  be  telescopically  multiplied  ([20,36])  in  order  to  com¬ 
pute  the  ratio  of  normalization  constants  between  any  pair  of  distributions  as 
required  in  Equation  (23). 


Given  the  preceding  discussion  in  sections  2.2,  2.3  and  2.4,  we  summarize 
below  the  basic  steps  in  the  proposed  free  energy  computation  scheme:  In  the 
inner  loop  and  for  fixed  /c,  gradient  descent  (Section  2.2)  is  performed  which 
makes  use  of  the  adaptive  SMC  scheme  (Section  2.4)  in  order  to  compute 
the  expectations  in  the  gradient.  In  the  outer  loop,  the  cardinality  of  the 
expansion  k  is  increased  by  adding  one  kernel  (i.e.  k  4—  h  +  1)  based  on 
Equation  (26).  This  is  terminated  when  the  KL  gain  (Equation  (23))  does  not 
exceed  a  prescribed  tolerance.  Algorithm  2  summarizes  formally  the  procedure. 


Remark:  As  mentioned  in  section  2.2  a  sufficient  condition  for  the  conver¬ 
gence  of  the  proposed  stochastic  approximation  scheme  is  the  unbiasedness  of 
the  gradient  estimators.  It  is  noted  that  theoretical  and  numerical  results  sug¬ 
gest  that  this  condition  is  more  stringent  than  necessary  and  can  be  relaxed 
[50,58,78,1,10].  While  the  unbiasedness  of  SMC  particulate  approximations  is 
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Algorithm  2  Calculation  of  the  free  energy  at  a  given  temperature. 
Require:  k  =  0,  0°  =  0  and  a  particulate  approximation  of  p( q,  z  |  0°) 

(Equation  (3))  at  the  desired  temperature  (3  (see  Remark  below). 

while  true  do 

Calculate  A*,  based  on  Equation  (23). 

if  Afc  <  tol  then 
Break  the  loop. 

else 

Add  the  optimal  (k  +  l)th  kernel  based  on  Equation  (26)  and  set  k  t— 
k  +  1. 

repeat 

Estimate  gradient  at  6kn  and  calculate  update  Gkn+ ,  based  on  Equa¬ 
tion  (18). 

Use  adaptive  SMC  (section  2.4)  to  construct  particulate  approxima¬ 
tion  of  p{ q,  z  |  0* +1)  from  p( q,  z  |  0*  ). 
until  Convergence  criteria  are  met. 

end  if 
end  while 

ensured  by  the  importance  sampling  step,  it  relies  on  an  unbiased  estimator  of 
the  initial  distribution  p(q,  z  |  0°).  In  order  to  achieve  this  we  considered  two 
schemes,  one  approximate  (but  nearly  exact)  and  one  exact.  The  first  involved 
running  a  long  MCMC  at  a  very  high  temperature  which  ensured  good  mixing 
(the  distribution  of  the  reaction  coordinates  was  effectively  uniform).  We  sub¬ 
sampled  the  chain  in  order  to  make  sure  that  the  samples  drawn  were  close  to 
independent  and  assigned  equal  weights.  Subsequently  the  SMC  scheme  was 
employed  to  adaptively  reduce  the  temperature.  In  physical  problems  where 
the  calculation  of  the  temperature-dependent  free  energy  surface  was  of  inter- 
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est  (see  section  3.2),  this  did  not  impose  any  additional  burden.  Furthermore, 
and  as  long  as,  the  first  temperature  was  high  enough  the  error  introduced  in 
terms  of  bias  was  negligible  as  the  MCMC  chain  for  all  practical  purposes  had 
attained  equilibrium.  The  second  menthod  which  is  exact,  relied  on  running 
MCMC  for  a  few  steps  (at  the  target  temperature)  and  estimating  the  mean 
and  covariance  of  the  atomistic  coordinates  q  (and  reaction  coordinates  2  is 
the  alchemical  case).  We  subsequntly  performed  importance  sampling  using 
as  an  importance  sampling  density  multivariate  Gaussians  with  the  aforemen¬ 
tioned  moments  and  calculated  weights  based  on  their  ratio  with  the  density 
p(q,  z\Gq).  This  ensured  an  unbiased  estimator  for  p(q,  z\Qq)  albeit  a  very  poor 
one  (with  large  variance)  as  the  importance  sampling  distribution  was  in  gen¬ 
eral  a  poor  approximation  of  p(q,z\6o).  Despite  the  different  initializations 
both  methods  converged  in  the  sample  problems  examined  which  is  a  testa¬ 
ment  to  the  power  of  the  stochastic  approximation.  We  also  performed  runs 
where  particles  were  generated  by  running  MCMC  with  p(q,  z\d$)  as  the  tar¬ 
get.  Despite  the  bias  in  the  estimator  (resulting  from  lack  of  equilibrium)  the 
method  was  still  able  to  converge  which  coincides  with  the  results  mentioned 
above. 


3  Extensions 

The  current  section  is  devoted  to  extensions  of  the  proposed  algorithmic  envi¬ 
ronment.  In  particular  we  discuss  the  reaction  coordinate  case  that  generalizes 
the  applicability  of  the  proposed  method  (Section  3.1).  Section  3.2  is  devoted 
to  the  calculation  of  the  free  energy  landscape  as  a  function  of  the  temperature 
where  it  is  shown  that  the  sequential  nature  of  the  proposed  methodology  can 
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lead  to  significant  computational  advantages. 


3.1  The  reaction  coordinate  case 

The  proposed  method  was  described  for  the  alchemical  case.  However,  it  is 
straightforwardly  generalized  to  cover  also  the  general  reaction  coordinate 
case.  Let  $,  :  M.  — >  T>  be  a  function  of  the  system  coordinates  q.  This  function 
is  called  a  reaction  coordinate  [45].  It  is  evident  that  q  can  be  viewed  in  a 
probabilistic  framework  as  a  random  variable  and  as  a  result: 

z  =  £(<?),  (35) 

is  also  a  random  variable.  The  probability  distribution  of  z  can  be  found  by 
integrating  out  q: 

P(z  |  P)  =  I  p(q)5{£(q)  -  z)dq  oc  J  exp  (~pV(q))  S{£{q)  -  z)dq.  (36) 

The  free  energy  A( z)  with  respect  to  the  reaction  coordinate  £(qr)  is  defined 

to  be  the  effective  potential  of  z  =  £(qr),  i.e. : 

p(z)  oc  exp  (—(3A( z)) .  (37) 

Combining  these  two  equations  we  see  that: 

H(z)  =  — /T 1  log  /  exp  (-/3V(q))  5(S(q)  -  z )dq.  (38) 

If  A(z;  6)  is  an  estimate  of  A( z),  we  define  a  new  probability  distribution  over 
q  as: 

p(q\0)  oc  lD(£(g))exp  (~P{V(q)  -  A{£(qf6)))  .  (39) 

It  is  straightforward  to  see  that  under  this  new  distribution  for  q,  the  pdf  of 
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z  becomes: 


P(z|0)  =  j  p{q\0)5(^{q)  -  z )dq  oc  lv(z)  exp  {-/3(A( z)  -  i(z;  0))}  .  (40) 


This  coincides  with  the  expression  in  Equation  (5)  and  therefore  the  ensuing 
derivations  hold  identically.  From  a  practical  point  of  view,  sampling  need 
only  be  performed  in  the  q  space  and  therefore  the  adaptive  SMC  schemes  are 


employed  to  obtain  particulate  approximations  of  the  density  in  Equation  (39). 


The  only  difference  appears  in  the  MCMC-based  Rejuvenation  step  where  the 
MALA  sampler  is  employed  only  with  regards  to  q.  In  particular  the  update 
of  Equation  (32)  now  becomes: 


(41) 


It  is  noted  that,  in  contrast  to  some  ABF  methods  which  require  second-order 
derivatives  of  £  [29],  the  proposed  technique  only  needs  first-order  derivatives. 
Finally,  we  point  out  that  the  ability  of  the  proposed  approach  to  provide 
efficiently  estimates  of  parametrized  free  energy  surfaces  (as  in  section  3.2 
with  respect  to  the  temperature  /3),  can  also  be  exploited  in  the  reaction 
coordinate  case  by  defining  a  joint  density: 


where  as  in  [51]  an  artificial  spring  with  stiffness  //  has  been  added.  Clearly  for 
H  — y  oo  one  recovers  the  aforementioned  description,  but  for  all  other  values 
of  p  the  formulation  reduces  to  that  of  Equation  (3)  where  in  place  of  V (q,  z) 
we  now  have  R(q)  +  ^  ||  z  —  £(q)  ||2.  One  can  therefore  obtain  free  energy 
surfaces  for  various  /i  values.  For  smaller  /i  the  free  energy  would  be  flatter 
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and  in  the  extreme  case  of  /i  =  0  it  would  be  constant.  As  /i  increases,  the 
complexities  of  the  free  energy  surface  would  become  pronounced.  Hence  by 
exploiting  the  idea  of  section  3.2,  a  sequence  of  problems  parametrized  by  /a 
rather  than  (3,  can  be  constructed  to  gradually  move  to  larger  /i  values  by 
using  the  free  energy  of  the  previous  /i  as  an  initial  guess  for  the  new  one. 
The  adaptive  SMC  scheme  would  ensure  a  smooth  enough  transition  while 
retaining  a  good  level  of  accuracy  for  the  approximations  obtained. 


3.2  Obtaining  the  free  energy  landscape  for  various  temperatures. 


The  methodology  described  in  the  previous  sections  is  suitable  for  calculating 
the  free  energy  as  a  function  of  z  at  a  given  temperature.  However,  one  is  often 
interested  in  the  temperature  dependence  of  the  free  energy  landscape.  In  order 
to  achieve  this  goal  we  make  use  of  the  following  two  facts.  Firstly,  the  free 
energy  landscape  at  higher  temperatures  is  flatter  and  secondly  that  nearby 
temperatures  have  similar  free  energy  landscapes.  Based  on  these,  we  propose  a 
natural  extension  to  the  sequential  sampling  framework  of  subsection  2.4  that 
can  efficiently  produce  estimates  of  the  free  energy  at  various  temperatures. 
The  idea  is  to  start  from  a  higher  temperature,  compute  the  free  energy  as 
described  before,  then  gradually  move  towards  lower  temperatures  using  the 
free  energy  of  the  previous  temperature  as  an  initial  guess  for  the  new  one. 
In  particular  given  the  free  energy  estimate  6((3 1))  and  the  particulate 

approximation  of  the  joint  density  p^(q,  z|0(/31))  at  a  temperature  1  / /?i ,  we 
propose  employing  the  aforementioned  adaptive  SMC  in  order  to  obtain  a 
particulate  approximation  of  the  following  joint  density  at  (32  >  (3\  (i.e.  for 


lower  temperature) 


P/?2(q>z  I  0(/3i))  oc  exp{-/32  (V(q,z)  -  APl(z\  0(/3i)))}  .  (43) 

The  iterations  enumerated  in  Algorithm  2  can  then  be  carried  out  in  the 
same  fashion  by  updating  the  existing  9  as  well  as  adding  new  kernels  if  the 
convergence  criteria  are  not  satisfied. 


The  critical  step  involves  building  a  sequence  of  distributions  from  (q,  z| 6 (Pi)) 
to  pp2( q,  z  |  0(f3 1))  in  Equation  (43).  For  this  purpose  and  similarly  to  a  sim¬ 
ulated  annealing  schedule  we  employ 

vr7(q,z)  oc  exp{-((l  -7)/3i  +  7/%)  (V( q,  z)  -APl{z]0(P i)))}  .  (44) 

The  steps  in  Algorithm  1  should  be  adjusted  to  the  aforementioned  sequence 
of  bridging  distributions  with  the  most  striking  difference  in  the  Reweighing 
step  where  the  updated  weights  in  Equation  (31)  should  now  be  given  by 


(i)/  \  (i)  7r-ys(qi-i>zs-i) 

wr(7s)  =  <~i  - — f;(o  ;(o , 


d-i  exp  {-(7,  -  A)  A(qi-i,4-i)  -  Ai (*;0(/8i 


(45) 


We  demonstrate  the  efficacy  of  such  an  approach  in  the  last  example  of  section 
4.  It  is  finally  noted  that  at  the  beginning  of  iterations  at  each  new  tempera¬ 
ture,  kernels  with  very  small  weights  8j  were  removed  if  m  jg  |  <  0.01. 


4  Numerical  Examples 

In  the  ensuing  numerical  examples  the  following  parameter  values  were  used: 
(1)  SMC:  ESSmin  =  N/2,  C  =  0.95. 
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(2)  Time-step  adaptation:  We  used  the  multiplication  factors  a*  =  1.2  to 
increase,  and  aa  —  0.7  to  decrease  the  time-step  so  that  the  acceptance 
ratio  remained  between  50%  and  80%. 

(3)  Stochastic  gradient  descent:  m0  =  0. 

The  rest  of  the  parameter  values  are  discussed  in  each  particular  problem. 

The  algorithm  we  used  for  the  solution  of  the  kernel  addition  optimization 
problem  defined  in  Equation  (26)  was  the  simplex  method  for  multidimen¬ 
sional  minimization  [60].  The  centers  z j  of  the  kernels  were  restricted  within 
the  domain  V,  while  the  bandwidths  tj#  were  allowed  to  have  values  within 
0.01diam(T>)  and  0.5  diam('D)  '  . 


4-1  Two-Dimensional  Toy  Example 


Consider  a  two-dimensional  system  [75,42]  with  a  single  parameter  z.  inter¬ 
acting  with  potential  energy: 

% (?;  z)  —  cos(2vrz)(l  +  diq )  +  d2q2. 

Assume  that  q  given  z  and  f3  is  distributed  according  to: 


p(q\z,P)  oc  exp  (—(3V (q;  z)) , 


where  /3  is  also  a  fixed  parameter  that  plays  the  role  of  an  inverse  temperature. 
We  wish  to  calculate  an  approximation  A(z)  of  the  free  energy  A(z)  on  an 
interval  V  =  [—0.5, 0.5].  The  true  free  energy  can  be  found  analytically  to  be 

.  .  .  ,  .  d?  cos(2vrz)2 

A(z )  =  cos(27tz) - - - h  c, 

4  do 


diam(D)  =  sup{|zi  —  Z2I  :  zi,  2.2  G  T>}  denotes  the  diameter  of  the  set  V. 
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where  c  is  a  constant  that  depends  upon  the  specific  choice  of  the  fixed  pa¬ 
rameters.  In  what  follows,  we  choose  c  so  that  A(— 0.5)  =  0. 

To  demonstrate  our  method  in  this  simple  example  we  used  d\  =  2,  d2  =  30. 
The  potential  energy  V (q\  z )  for  this  choice  of  the  parameters  is  depicted  in 
Figure  1(a).  We  fix  the  inverse  temperature  to  (3  =  10.  As  shown  in  Figure 
1(b),  the  distribution  is  bimodal  with  a  big  region  of  practically  zero  prob¬ 
ability  separating  the  two  modes.  Hence,  metastability  along  the  parameter 
z  is  apparent.  The  performance  of  the  proposed  method  with  respect  to  the 
number  of  replicas  used  in  the  adaptive  SMC  scheme  is  depicted  in  Figures  2 
and  3  which  show  the  evolution  of  the  estimated  free  energy  landscape  with 
N  =  100  and  N  =  10,  000  replicas  respectively.  In  both  cases  the  method  is 
capable  of  capturing  correctly  the  characteristics  of  the  reference  solution  and 
as  expected  the  variance  of  the  computated  solution  is  less  when  the  number 
of  replicas  is  larger.  In  both  cases  the  Robbins-Monro  learning  series  is  picked 
to  be  rjm  =  r]m~a  with  a  =  0.6  and  the  learning  rate  rj  =  0.1. 

Figure  4(a)  shows  the  first  three  kernels  selected  by  the  greedy  scheme  de¬ 
scribed  in  section  2.3.  Figure  4(b)  depicts  the  log-values  of  the  kernel  weights 
{9j}j=1  which  clearly  demonstrate  the  ability  of  the  proposed  approach  to  pro¬ 
vide  sparse  approximations.  The  first  kernel  selected  has  the  greatest  weight 
and  hence  it  contains  the  majority  of  the  information  about  the  free  energy 
curve.  The  rest  of  the  kernels  are  progressive  corrections  of  the  estimate  given 
by  the  first  kernel.  This  conclusion  is  also  supported  by  the  result  of  Figure  5 
which  shows  the  evolution  of  the  reduction  in  the  KL  divergence  with  respect 
to  the  total  number  of  iterations  as  quantified  by  adding  the  Afc+1  in  Equa¬ 
tion  (23).  Clearly  the  first  kernel  offers  the  greatest  KL  gain  (Ai)  and  further 
kernel  additions  offer  (almost  always)  progressively  smaller  reductions  in  the 
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(a)  The  potential  energy  V(qi,q2)  for  (b)  The  probability  distribution 
d\  =2 ,d2  =  30  P(qi,q2\f3)  for  d\  =  2 ,d2  =  30 ,/3  =  10 

Fig.  1.  Potential  energy  and  pdf  for  the  toy  example  of  section  4.1 

KL  divergence. 


4-2  WCA  Dimer 


We  consider  n  —  16  atoms  in  a  two-dimensional  fully  periodic  box  of  side  l 
which  interact  with  a  purely  repulsive  WCA  pair  potential  [42]: 


CwcaM 


,  if  r  >  r0 
,  otherwise 


where  r0  =  2 The  parameters  a  and  e  give  the  length  and  energy  scales 
respectively.  Two  of  these  atoms  (say  atoms  1  and  2)  are  assumed  to  form  a 
dimer  and  their  interaction  is  described  instead  with  a  double-well  potential: 

(r  —  r0  —  w )2 


Vs(r)  =  h 


1  - 


wr 


where  h ,  w  are  fixed  parameters  and  r  the  distance  between  them.  This  po¬ 
tential  has  two  equilibrium  points  ro  and  +  2w.  The  barrier  separating  the 
two  equilibria  is  h. 
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(c)  k=3  (d)  k=8 


Fig.  2.  Free  energy  profiles  for  various  kernel  numbers  k  when  using  N  =  100  replicas 
in  the  adaptive  SMC  scheme 


Let  q  =  (qi,q2,...  ,qN )  with  qt  E  M2  denoting  the  position  of  atom  i.  The 
potential  energy  of  the  system  is 

2  N 

^(9)  =  Vs(|gi-g2|)  +  5ZZ)vwcA(|gt-gj|)+  5Z  ^wcaC*  -  Qj\)- 

i= 1  j= 3  2  <i<j 

We  consider  an  NVT  ensemble  (the  volume  V  is  determined  by  the  side  of  the 
box  /).  The  probability  distribution  of  the  atomic  positions  q  is: 


p{q\P)  oc  exp  (—/3V (q)) , 

where  (3  =  ks  is  the  Boltzmann  constant  and  T  the  temperature  of  the 
system.  Under  these  assumptions  atoms  1  and  2  will  form  a  dimer  with  two 
equilibrium  lengths.  An  effective  potential  of  the  dimer  length  in  the  presence 
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(c)  k=3  (d)  k=8 


Fig.  3.  Evolution  of  free  energy  estimates  at  various  kernel  numbers  k  when  using 
N  =  10, 000  replicas  in  the  adaptive  SMC  scheme 


of  the  other  atoms  is  given  by  the  free  energy  A(r)  with  respect  to  the  reaction 
coordinate: 

*  =  £(<?)  =11  9i  -  92  |h, 
where  |  j  -  1 1 2  is  the  Euclidean  norm  of  M2. 

We  calculate  A(z)  using  our  scheme  for  two  different  box  sizes  (densities):  l  =  5 
(high  density)  and  l  =  12  (low  density).  The  parameters  are  set  to  n  =  16 
atoms,  [3  —  1,  e  =  1,  <7  =  1,  h  —  1,  w  =  0.5.  We  employed  N  =  500  replicas  and 
the  Robbins-Monroe  learning  series  is  again  rjm  =  r]m~a  with  a  =  0.501  and 
77  =  0.1.  The  resulting  free  energy  curves  at  various  stages  of  the  estimation 
process  with  increasing  number  of  kernels  are  shown  in  Figure  6.  Figure  7 
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(a)  First  three  kernels  Kj{.)  picked  by 
the  greedy  optimization  scheme  to  il¬ 
lustrate  selection  of  location  and  band¬ 
width  parameters  (N  =  10,000). 


(b)  Log  absolute  weights  (log  |  9j  |)  of 
the  first  10  kernels  added.  The  6  value 
of  the  first  kernel  is  over  one  order  of 
magnitude  larger  than  the  rest. 

10,  000  replicas 


Fig.  4.  Kernels  selected  and  kernel  weights  obtained  with  N 


reference 


Fig.  5.  Evolution  of  the  KL  divergence  gain  (Equation  (23))  at  various  k  with 


respect  to  the  total  number  of  gradient  ascent  iterations  performed.  The  dashed  line 


(reference)  depicts  the  KL-distance  between  the  initial  distribution  (6  =  0)  and  the 


target  uniform  distribution  which  can  be  calculated  analytically  for  this  example. 
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(a)  Low  Density  l  =  12.  (b)  High  Density  1  =  5. 

Fig.  6.  The  free  energy  of  the  dimer  at  two  different  densities  and  for  various  numbers 
of  kernels  k,  compared  with  Vs(r).  Notice  that  at  low  density  (a)  the  right  well 
becomes  the  most  probable.  This  situation  is  reversed  at  high  density  (b). 

compares  the  empirical  cumulative  distribution  function  of  the  replicas  with 
that  of  the  target  uniform.  Their  proximity  indicates  convergence  as  it  can 
also  be  established  by  a  Kolmogorov-Smirnov  test. 

From  a  physical  point  of  view,  we  notice  that  at  low  density  i.e.  when  the  box 
size  is  l  =  12  (Figure  6(a)),  the  equilibria  move  to  the  right  with  the  well 
closest  to  r0  +  2 w  becomes  the  most  probable.  Furthermore  the  free  energy 
barrier  is  slightly  decreased  as  compared  to  the  high  density  case  when  /  =  5 
(Figure  6(b)).  Under  these  conditions  the  equilibria  move  to  the  left  and  the 
well  closest  to  r0  becomes  the  most  probable. 
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Fig.  7.  Comparison  of  the  empirical  cumulative  distribution  function  of  the  reaction 
coordinates  of  the  replicas  with  that  of  the  target  uniform  for  1  =  5.  We  also 
performed  a  Kolmogorov-Smirnov  test  [54]  as  implemented  in  Matlab  [55].  The 
default  confidence  level  was  h  =  0.05.  The  hypothesis  was  accepted  and  the  test 
reported  an  asymptotic  p  value  of  0.3515.  The  value  of  the  test  statistic  was  0.0631. 

4-3  38-Atom  Lennard- Jones  Cluster  ( LJ 3$) 

We  consider  a  38-atom  cluster  in  3-dimensional  space  with  pairwise  interac¬ 
tions  given  by  the  Lennard-Jones  potential: 


with  e  and  a  playing  the  role  of  energy  and  length  scale  respectively.  Let  the 
Cartesian  coordinates  of  the  system  be: 

q  =  (qi,.  .  .  ,q38)  ,q*  G  M3.  (47) 

Then  the  potential  energy  of  the  system  is: 

^(q)  =  5IyLj(lq*-qil)- 

i<j 

Finally  we  assume  that  the  replicas  follow  an  NVT  distribution  of  the  form: 

PW)  oc  exp  {—fiV  (q)}  , 
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(a)  Q4  »  0.01. 


(b)  Qi  pa  0.19  (truncated  octahedron). 


Fig.  8.  Indicative  metastable  states  corresponding  to  the  two  wells  of  the  free  energy 
landscape  with  respect  to  order  parameter  Q\  (Equation  (48)). 

where  /?  =  1/ksT.  At  zero  temperature  the  system  is  known  to  have  a  global 
minimum  yielding  an  FCC  truncated  octahedron  (Figure  8(b)).  The  second 
and  third  lower  energies  give  incomplete  Mackey  icosahedra.  Furthermore 
there  is  a  big  number  of  liquid-like  local  minima  ([25,7]). 

Consider  the  family  of  order  parameters  initially  introduced  in  [69]: 


(48) 


with: 


Qlm  AT 

^ b  rtj<r o 


where  the  sum  is  over  all  the  iVj,  pairs  of  atoms  with  =  q,:  —  cp  <  ro, 
Yim(Q ,  0)  is  a  spherical  harmonic,  while  9tJ  and  <f>ij  are  the  polar  and  azimuthal 


angles  of  a  bond  vector  with  respect  to  an  arbitrary  coordinate  system.  In  [7] 


it  is  shown  that  for  l  —  5,  Q4  can  distinguish  the  FCC  structure  but  not 
the  icosahedra!  and  liquid-like  minima  (Figure  8(a)).  However,  if  one  also 
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considers  the  potential  energy  as  a  reaction  coordinate,  the  two  structures  are 
well-separated.  Hence,  we  define  the  two  dimensional  reaction  coordinate: 

f(q)  =  (Q4(q),H(q)) . 

and  compute  the  free  energy: 

A(Qi,E)  =  /3_1  j  exp{-/3U(q)}h(Q4  -  Q4(q))5(E  -  V(q))dq, 
over  the  domain: 

V  =  [0,0.2]  x  [— 175e,  —  145e], 

for  a  temperature  range  ksT  =  0.21  to  ksT  =  0.091  using  the  temper¬ 
ing  scheme  described  in  Section  3.2.  We  employ  N  =  100  replicas  and  10 
MCMC/ Rejuvenation  steps  per  replica.  At  each  (3  =  /c^T,  the  Robbins-Monro 
learning  series  was  adjusted  to  r]m  =  rjm~a  with  a  =  0.501  and  a  learning  rate 
ij  =  0.1  /  (3.  The  adaptive  SMC  scheme  automatically  determined  260  inter¬ 
mediate  steps/distributions  in  order  to  cover  the  whole  range  of  the  afore¬ 
mentioned  temperatures.  The  time  step  A tq  employed  in  the  MALA  sampler 
was  adaptively  adjusted  as  discussed  previously  and  took  values  between  10-4 
(low  temperatures)  and  7  x  10~4  (high  temperatures).  The  very  first  step,  at 
T  =  0.21  (/3  =  4.76)  required  12,000  optimization  iterations  to  converge  with 
a  cost  of  approximately  7.2  x  105  time  steps  per  replica,  ft  is  emphasized  that 
due  to  the  parallelizable  nature  of  the  SMC  scheme  employed,  each  replica 
can  be  simulated  on  a  different  CPU.  Centralized  control  is  only  required  for 
the  evaluation  of  the  ESS  (Reweighting  step)  and  during  the  Resampling  step 
which  are  computationally  less  expensive  than  the  Rejuvenation  step. 

The  sequence  of  intermediate  /Ts  determined  automatically  by  the  scheme 
discussed  in  section  3.2  is  depicted  in  Figure  9.  The  similarity  of  the  free  en- 
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ergy  surfaces  at  neighboring  temperatures  allowed  us  to  converge  with,  on 
average,  800  optimization  iterations  at  each  intermediate  f3.  The  overall  simu¬ 
lation  cost  amount  to  2.4  x  104  time  steps  per  replica.  This  was  approximately 
30  times  less  than  the  7.0  x  105  time  steps  per  replica  which  were  needed 
when  calculating  the  free  energy  solely  at  the  final  /3  =  0.091  i.e.  without 
using  the  sequence  of  temperatures  and  corresponding  free  energies.  The  sig¬ 
nificant  reduction  indicates  the  potential  computational  savings  afforded  by 
the  sequential  approach  advocated  in  this  work. 

The  free  energy  surfaces  computed  are  depicted  in  Figure  10  at  four  indicative 
temperatures.  The  number  of  kernels  selected  by  the  algorithm  varied  between 
90  and  120.  As  it  has  been  reported  in  previous  studies  [7],  we  identified  two 
metastable  states  at  Q 4  ~  0.01  which  corresponds  to  the  truncated  octahedron 
and  at  Q 4  ~  0.19  which  corresponds  to  the  icosahedron.  The  latter  becomes 
more  pronounced  at  lower  temperatures. 

I11  order  to  assess  the  quality  of  the  results  in  two  dimensions  we  also  calcu¬ 
lated  the  free  energy  profile  using  only  Q 4  as  the  reaction  coordinate  (Fig¬ 
ure  11)  and  compared  it  with  the  result  obtained  by  performing  a  numeri¬ 
cal  integration  of  the  two-dimensional  free  energy  surface  i.e.  by  computing 
A(Q4)  =  —  /3~1logf  e~^A^4,E^dE.  The  two  free-energy  curves  are  depicted  in 
Figure  11  where  good  agreement  is  observed  at  two  different  temperatures. 


5  Conclusions 

In  summary,  the  proposed  method  provides  a  unifying  framework  for  estimat¬ 
ing  the  free  energy  function  simultaneously  with  biasing  the  dynamics.  The 
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Fig.  9.  Sequence  of  intermediate  /3’s  identified  by  the  scheme  discussed  in  section 
3.2  for  the  LJ 35  cluster.  The  free  energy  landscape  was  calculated  at  each  of  these 
temperatures  by  efficiently  updating  the  free  energy  surface  at  the  previous  step. 


minimization  of  the  Kullback-Leibler  divergence  in  the  extended  space  pro¬ 
vides  rigorous  convergence  bounds  and  diagnostics.  It  requires  minimal  ad¬ 
justment  of  parameter  values  a  priori  (basically  only  the  learning  rate  A  and 
convergence  tolerances)  as  it  is  adaptive  and  automatically  promotes  sparse 
representations  of  the  free  energy  surface.  The  proposed  approach  shares  a 
common  theme  with  other  adaptive  methods  in  free-energy  estimation  and 
Monte  Carlo  methods  in  general,  in  that  the  target  distribution  (in  our  case 
p( q,  z  |  0))  is  modified  from  iteration  to  iteration  based  on  its  past  samples 
[48].  The  approximation  of  the  free  energy  A(z;0),  biases  the  potential  of 
p(q,  z  |  6)  (Equation  (3))  and  allows  the  system  to  overcome  free  energy  bar¬ 
riers  [53].  As  in  [24],  no  binning  is  needed  and  the  bias  potential  is  nonlocal, 
providing  information  about  the  free  energy  landscape  not  only  at  the  states 
visited  but  in  their  neighborhood  as  well.  It  offers  several  possibilities  for  fur¬ 
ther  improvements  by  considering  different  optimization  schemes  (e.g.  noisy 
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(c)  T  =  0.14.  (d)  T  =  0.11. 

Fig.  10.  Free  energy  contours  A(Q 4,  E)  with  respect  to  the  two  reaction  coordinates 

Qa  (x-axis)  and  E  (y-axis)  at  various  temperatures  for  LJ 38. 

conjugate  gradients)  and  employing  different  basis  functions  (e.g.  wavelets). 

Its  sequential  nature  allows  the  efficient  computation  of  a  family  of  free  en¬ 
ergy  surfaces  at  different  temperatures.  We  believe  that  these  features  make 
the  proposed  approach  suitable  to  calculate  the  free  energy  of  systems  more 
physically  challenging  than  the  ones  discussed  in  this  paper. 
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Fig.  11.  Free  energy  profiles  A(Q 4)  obtained  using  the  proposed  scheme  while  using 
only  reaction  coordinate  (dashed  lines)  and  by  integrating  the  two  dimensional  free 
energy  surface  i.e.  ^(^4)  =  —  /3_1  log  f  e~^A^i,E^dE  (solid  lines)  for  two  tempera¬ 
tures  T  =  0.19  and  T  =  0.14. 
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Appendix  A 


This  appendix  contains  a  proof  of  the  second  sufficient  condition  C2  for  the 
almost  sure  convergence  of  the  proposed  stochastic  approximation  scheme  as 
discussed  in  Section  2.2. 

For  a  fixed  number  of  kernels  to  k.  let  6k  G  ©  be  the  unique  global  minimum 
of  I(6k),  let  0  <  e  <  1  and  define  the  subset  of  ©: 

©e  =  \ek  G  ©  :  e  <||  ek  -  ek  ||<  1/e}  . 

Our  goal  is  to  prove  that: 

inf  ( ek  -  ekY  j(ok )  =  inf  (ek  -  ek)T  j(ok )  >  o.  (49) 

ek&®,  V  J  e<||0fc— 0fc||<l/e  V  > 

To  this  end,  observe  that: 


Z(»*)  =  dqch 


—  Z(6k)Ep(z\gk ) 


j(A(z-,ek)-A(z-,ok)) 


(50) 


and  as  a  result  of  Jensen’s  inequality: 


Z{6k)  ^p(z\9k) 


j(A(z-,ek)-A(z-e)) 


>  pEp[z ]gk)  A{ z;  Gk)  -  A( z;  Gk) 


(51) 


Hence,  the  difference  between  the  values  of  the  objective  function  at  the  op- 
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timum  point  6k  and  an  arbitrary  6k  G  ©e  is  non-negative  and: 


0  <  I(6k)  —  I(6k )  =  —  fD  7r(z)  logp(z|0fc)  dz  +  Jv  7r(z)  logp(z|0fc)  d z 


=  f3En(z)  A( z;  0fc)  -  A(z;  0fc)  dz  -  log  fjfjj 


<  /3  (K(z)  [i(z;  0fc)  -  i(z;  0fc) 


£ 


'p(z|0fc) 


i(z;0fc)  -  i(z;0fc) 


=  -  -  9?)/?  [Er[^){Kj(i)}  ~  ®.(.|KW] 


=  (e*  -  ek)Tj(ek) 

(52) 

where  the  last  equality  is  a  result  of  the  definition  of  the  gradient  J(6)  of  1(6) 
in  Equation  (15).  Given  that  (6k  —  6k)TJ(6k)  >  I(9k)  —  I(6k),  it  suffices  to 
prove  that: 

mf_  (1(9“)  -  1(9“))  >  0.  (53) 

We  do  this  by  contradiction.  Suppose  that: 

^iiffi  (d(0fc)  -  I(6k))  =  0  (54) 

Then,  there  exists  a  sequence  { 6 C  ©e  such  that: 

I(9kJ  -G  I(6k  ),  asm-)-  oo.  (55) 


Since  the  sequence  { 6 is  bounded,  it  has  a  convergent  subsequence  {6knn}, 
i.e 

6k  — y  6*,  as  n  — *  oo, 

for  some  6*  e  clos(©e)  8 .  From  the  continuity  of  1(6)  with  respect  to  6 ,  we 
must  also  have: 

I(9kmJ  1(6*) 

8  clos(^4)  denotes  the  closure  of  the  set  A  C  M.k  under  the  Eucledian  metric. 
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From  the  uniqueness  of  the  limit  in  Equation  (55),  we  get  that: 


I(6k)  =  1(6*) 


Since  1(6)  has  a  unique  global  minimum: 

6k  =  6*  e  clos(0e), 
which  contradicts  Equation  (54). 
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Appendix  B 


This  appendix  contains  an  alternative  interpretation  of  the  proposed  for¬ 
mulation  (section  2.1)  through  the  prism  of  the  well-known  Expectation- 
Maximization  scheme  (EM,  [23])  and  offers  some  potential  Bayesian  exten¬ 
sions. 

For  that  purpose  we  define  the  function  I' (6)  =  —1(0)  —  Jv  7r(z)  logp(z  |  0)dz 
(Equation  (12)).  Maximizing  I'(0)  is  equivalent  to  minimizing  1(0).  We  note 
that  for  all  z  and  any  density  Q(q),  q  £  JA: 

log  p(z  |  0)  =  log  JMp(q,z  |  0)  dq 
=  log  fMQ(q)^y1  dz 

>  fM  Q(q)  log  P<yq^  dz  (Jensen's  inequality ) 

=  ImQ(^)1oSP(Q,z  I  e)  dz  —  J  Q(q)  log  Q(q)  dq 
=  HQ,0;z) 

This  lower  bound  J-  holds  for  all  z  G  T>  and  depends  on  the  auxiliary  density 
Q(q)  and  the  model  parameters  0.  Furthermore  it  also  provides  a  lower  bound 
on  I'(0)\ 

I'(0)>  [  n(z)(F(Q,0-,z))  dz  (57) 

Jv 

For  any  z  G  T>  and  fixed  0,  the  optimal  density  Q  is: 

Q°*(q\0)=p(q\z,B)  (58) 

in  which  case  the  inequalities  in  Equations  (56)  and  (57)  become  equalities. 
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More  importantly  though  the  introduction  of  the  auxiliary  density  Q  suggests 
an  iterative  algorithm  for  finding  the  optimal  parameter  values  for  9.  Initial¬ 
izing  with  an  value  00,  at  each  subsequent  iteration  m  we  alternate  between 
the  following  two  steps: 

•  Expectation  step  (E-step): 

Qm(z)  =  argma xJr(Q,  0m;  z)  =  Qopt(q  \  Qm- 1)  =  p(q  \  z ,  0m_i),  Vz  e  V 

(59) 

•  Maximization  step  (M-step): 


dm  =  arg  rnaxe  fv  vr(z)  9 ;  z))  dz 

=  arg  maxe  fv  (fM  p(q  \  z,  0m- 1)  log p(q,  z  \  6)  dq)  n (z)  dz  (60) 
=  arg  rnaxg  H(0,  Gm_ i)  =  arg  maxo  I' (6) 

This  essentially  suggests  a  coordinate  ascent  that  alternates  between  6  and 
Q(q)  which  is  guaranteed  to  converge  to  a  local  maximum  [57,59].  We  show  in 
the  sequel  that  the  first  and  second  order  derivatives  of  H(9,  0m-\)  coincide 
with  those  of  I'{0).  As  a  result  they  have  a  unique  and  identical  maximum. 

In  particular,  substituting  from  Equation  (3),  we  obtain: 


i)  =  Sv  ( JMp(q  |  z,em_  i)(-|8  (l/iq.  z)  -  A(z;6))  -  logZ(fl))  *(z)d) 


Jv  {Er(, b,«„-0  [~P  E(q ,*)  -  A(z-,  9)) 


ir(z)dz  —  log  Z(G) 
(61) 


Using  Equation  (14)  and  the  expansion  of  Equation  (7)  for  A(z;  9),  we  obtain 
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the  gradient  of  X7qH(9,  0m-i)  with  respect  to  9: 


dH{e,em.1)  _  f  ndA(z;0)  ,  w  _  aio gZ(0) 
ae,  JvP  ae „•  7rvzJaz  as. 


/^(Z) 


&4(z;0) 

ae 


P 


a  log  z(o) 
ae. 


(62) 


/?(£»«  Ml) 


Furthermore,  the  Hessian  of  the  objective  function  H(9 ,  0m_i) 
to  the  covariance  between  the  kernels  i.e. : 

a2H{ e,em-i)  _  a\ogz(o) 

aOjdei  dOjdOi 


is  proportional 


(63) 


=  -D  2Covrw)[Kj,Kl] 


A  Bayesian  extension  to  the  aforementioned  formulation  could  be  readily  ob¬ 
tained  by  the  introduction  of  a  prior  density  on  9 ,  i.e.  p(9).  In  this  case  the 
objective  function  I' (9)  =  —1(9)  (Equation  (11))  can  be  interpreted  as  the 
limiting  log- likelihood  in  the  case  of  infinite  “observations”  {z j}”=1  (n  — >  oo) 
from  the  uniform  density  ir(z)  on  T>,  i.e.: 

log p(zi  I  9)  -»■ 

i—  1 

Hence  the  log-posterior  (conditioned  on  the  “observations  “  zp  would  be: 

1(9)  =  log p(9  |  (zjljFj00)  oc  /  7r(z)logp(z  |  9)  dz  +  \og p(9) 

'  ’  lo9~prior  (65) 

=  I'(9)  +  logp(0) 

A  maximization  of  the  log-posterior  1(9)  amounts  to  a  MAP  (Maximum  A 
Posteriori)  point  estimate  of  9.  The  gradient  and  Hessian  of  I (9)  would  then  be 
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7r(z)  logp(z  |  9)  dz 


(64) 
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penalized/regularized  versions  of  the  respective  gradient  and  Hessian  of  I'{0). 
Appropriate  prior  modeling  could  provide  interesting  extensions  in  the  context 
of  sparse  representations  ([27,73]).  Prior  modeling  could  also  be  extended  ed 
to  the  remaining  parameters  of  the  expansion  e.g.  kernel  bandwidths. 
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